Business Analytics

Data Visualization

Ayush Patel

10 January, 2024

Pre-requisite

You already…

  • Know basics of data wrangling
  • Understand different data types
  • Understand different types of objects

Before we begin

Please install and load the following packages

library(tidyverse)
library(ggplot2) ## is this command really needed?



Access lecture slide from the course landing page

About Me

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

The ggplot2 Package

Content for this topic has been sourced from Winston Chang’s ‘R Graphics Cookbook, 2nd edition’. Please check out his work for detailed information.

  • ggplot2 takes a different approach to graphics than other plotting packages in R
  • Gets its name from Leland Wilkinson’s grammar of graphics
  • Grammar of graphics - provides a formal, structured perspective on how to describe data graphics

ggplot2 Terminology

Content for this topic has been sourced from Winston Chang’s ‘R Graphics Cookbook, 2nd edition’. Please check out his work for detailed information.

Some of the terminologies used in ggplot2:

  • data- what we want to visualize and consists of variables
  • Geoms - geometric objects that are drawn to represent the data, such as bars, lines, and points
  • aesthetics - visual properties of geoms, such as x and y position, line color, point shapes, etc
  • There are mappings from data values to aesthetics

Building a plot

Effective design should start with a visual task analysis, determine the set of visual queries to be supported by a design, and then use color, form, and space to efficiently serve those queries. - Colin Ware

We will use mpg and diamonds dataset for learning data visualization You can run ?mpg and ?diamondsto understand the variables of the data

Components of a plot - ggplot()

An example - Plotting City Miles by Fuel Type

ggplot(data = mpg)  # the plot area and data

Components of a plot - geom() & aes()

ggplot(data = mpg) + # the plot area and data
geom_boxplot(
    aes(fl, cty),
    fill = "steelblue",
    alpha = 0.5
    )  # geom and aesthetic

Components of a plot - layers

ggplot(data = mpg) + # the plot area and data
geom_boxplot(
    aes(fl, cty),
    alpha = 0.5,
    fill = "steelblue"
    )+  # geom and aesthetic
  geom_jitter(
    aes(fl, cty),
    alpha = 0.5,
    colour = "steelblue"
    )  # another layer with aesthetics

Components of plot - theme()

ggplot(data = mpg) + # the plot area and data
  geom_boxplot(
    aes(fl, cty),
    alpha = 0.5,
    fill = "steelblue"
    )+  # geom and aesthetic
  geom_jitter(
    aes(fl, cty),
    alpha = 0.5,
    colour = "steelblue"
    )+  # another layer with aesthetics
  theme_bw()  # theme

Components of a plot - labs()

ggplot(data = mpg) + # the plot area and data
  geom_boxplot(
    aes(fl, cty),
    alpha = 0.5,
    fill = "steelblue"
    )+  # geom and aesthetic
  geom_jitter(
    aes(fl, cty),
    alpha = 0.5,
    colour = "steelblue"
    )+  # another layer with aesthetics
  theme_bw()+  # theme
  labs(
    x = "Fuel Type",
    y = "City Miles (per gallon)",
    title = "City Miles (per gallon) of Cars by Fuel Type"
  ) # labels

Plotting a continuous variable

  • Histogram of displacement
  • Useful for seeing distribution of a variable
ggplot(data = mpg)+ #specifying data
  geom_histogram(
     aes(x = displ)
     ) #geom and aesthetic

  • Density Plot of Displacement
  • A more smooth version of the histogram
ggplot(data = mpg)+ #specifying data
  geom_density(
    aes(x = displ)
    ) #geom and aesthetic

Do it Yourself - 1

  • Load the yrbss_samp dataset from openintro package
  • Make a histogram for the height variable
  • Make a density plot to see the distribution of the weight variable
  • What can you infer from the density plot of age?

Potting a discrete variable

  • Useful for calculating frequency of discrete variables
  • Plotting the type of drive train
ggplot(data = mpg)+
  geom_bar(
    aes(x = drv)
    )

Do it Yourself - 2

  • Using the yrbss_samp data, plot a bar chart to see the distribution of males and females
  • Similarly, what is the distribution of Hispanic and people who are not Hispanic in the sample? Show using a data visualization

Plotting two continuous variables

  • Scatterplot - useful for seeing the relationship of one plot with another
  • Potting displacement by city miles
ggplot(data = mpg)+
  geom_point(
    aes(x = displ,y = cty)
    ) #adding  a layer of points

Plotting one continuous and one discrete variable

Plotting mean displacement by type of the car

mpg %>%
  group_by(class) %>%
  summarise(
    mean_displacement = mean(displ)
    ) %>% # data wrangling
  ggplot() +
  geom_col(
    aes(class, mean_displacement)
    ) #adding a column

Do it Yourself - 3

  • Plot a scatterplot of height and weight
  • Make a bar chart to show the difference between the mean height between male and female respondents
  • Depict visually what is the difference between the mean days that males and females do strength training
  • Filter the data for only males, and then make a bar chart to see whether there is a difference between the average height between people who are Hispanic as opposed to those who are not

A graphing template

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science’. Please check out his work for detailed information.

  • Now that you have looked at some graphs, a graphing template can be
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Aesthetics

  • Often, you do not just work with two variables
  • There are additional variables that you want to see along with the variables on the two axes
  • Aesthetics are used to show visually an additional parameter

Aesthetics - Colour

ggplot(data = mpg)+
  geom_point(
    aes(x = displ,y = cty, 
        colour = drv)
    ) # shows the different type of drive train through colours

Aesthetics - Shape

ggplot(data = mpg)+
  geom_point(
    aes(x = displ,y = cty, 
        shape = drv)
    ) # shows the different type of drive train through shapes

Aesthetics - Alpha

ggplot(data = mpg)+
  geom_point(
    aes(x = displ,y = cty, 
        shape = drv), 
    alpha = 0.5) #sets the opacity

Aesthetics - Size

ggplot(data = mpg)+
  geom_point(
    aes(x = displ,y = cty, 
        size = drv), 
    alpha = 0.5)

Do it Yourself - 4

  • To the scatterplot of height and weight made earlier, add gender as the colour aesthetic
  • To the previous plot, instead of gender as the colour, add gender as the shape aesthetic
  • To the second plot, add alpha as first as 0.2, then as 0.5 and then 0.8. What is the difference?
  • Replace size as the aesthetic to the first plot

Facet Wrap

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science’. Please check out his work for detailed information.

  • Faceting is a way to create smaller categories
  • A way to see categories individually
  • facet_wrap() only uses a discrete variable

Cleaning the plot

Code
mpg %>%
  group_by(class) %>%
  summarise(mean_displacement = mean(displ)) %>%
  ggplot(
    aes(x = reorder(class, mean_displacement), 
        y = mean_displacement)
    ) +
  geom_col()

Code
mpg %>%
  group_by(class) %>%
  summarise(mean_displacement = mean(displ)) %>%
  ggplot(aes(x = reorder(class, mean_displacement), y = mean_displacement)) +
  geom_col()+
  coord_flip() #flips the axes

Code
mpg %>%
  group_by(class) %>%
  summarise(mean_displacement = mean(displ)) %>%
  ggplot(aes(x = reorder(class, mean_displacement), y = mean_displacement)) +
  geom_col(width = .65, fill = "#118B60") +
  coord_flip()

Code
mpg %>%
  group_by(class) %>%
  summarise(mean_displacement = mean(displ)) %>%
  ggplot(aes(x = reorder(class, mean_displacement), y = mean_displacement)) +
  geom_col(width = .65, , fill = "#118B60")+
  coord_flip()+
  labs(title = "2 Seater Has the Highest Displacement",
       subtitle = "Mean Displacement (in litres) vs Class of the Model",
       y = "Mean Displacement",
       x = "Type of the car",
       caption = "Data Source : mpg | Analysis by Student")+
  theme_bw()

Do it Yourself - 5

  • Convert the strength_training_variable into categories, with values between 0 as ‘no training’, 1-3 as ‘low training’, 3-5 as ‘moderate training’ and more than 5 as ‘high training’
  • Make a bar plot of the mean weight of people in these categories
  • Reorder the bars in descending order and reduce their width
  • Now, flip the axis of the previous plot
  • Format the graph properly, with proper labels, captions, data source and colours

Bar Charts

Code
mpg %>%
  group_by(fl, drv) %>%
  summarise(mean_displacement = mean(displ)) %>%
  ggplot(aes(x = fl, y = mean_displacement, fill = drv)) +
  geom_bar(position="dodge", stat="identity")+ 
  scale_fill_brewer(palette = "Pastel1")+
  theme_bw()

Code
mpg %>%
  group_by(fl, drv) %>%
  summarise(mean_displacement = mean(displ)) %>%
  ggplot(aes(x = fl, y = mean_displacement, fill = drv)) +
  geom_bar(position="stack", stat="identity")+
  scale_fill_brewer(palette = "Pastel1")+
  theme_bw()

Bar Graph with Counts

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science’. Please check out his work for detailed information.

By default, R takes count as y variable

Code
ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut)) #to count and plot

Line Chart

Useful for showing trend over time Using data tourism and drawing a line chart for unemployment over the years

Code
tourism <- openintro::tourism
tourism %>%
  ggplot(aes(x= year, y = visitor_count_tho)) +
  geom_line(group = 1, color = '#E54B4B', lwd = 1) +
  geom_point() +
  ggtitle("Increase in Number of Tourists in Turkey") +
  xlab("Year")+
  ylab("Number of Visitors") +
  theme_bw()

Area Chart

A modification to the line chart

Code
tourism <- openintro::tourism
tourism %>%
  ggplot(aes(x= year, y = visitor_count_tho)) +
  geom_area( fill="#69b3a2", alpha=0.4) + #to get the area below the graph
  geom_line(group = 1, color = '#E54B4B', lwd = 1) +
  ggtitle("Increase in Number of Tourists in Turkey") +
  xlab("Year")+
  ylab("Number of Visitors") +
  theme_bw()

Do it Yourself - 6

  • To the previous plot of training and weight that you made, convert it into a grouped barchart with the groups as the gender variable
  • Alternatively, make it a stacked bar chart
  • Use the unempl dataset and make a line chart of rate of unemployment over the years. Make sure that you properly format the graph as required
  • Convert the previous line graph into an area chart

Thank you :)